Releases · Mozilla-Ocho/llamafile

23 Jul 18:13

jart

0.8.11

109e926

llamafile v0.8.11

7469a23 Add smaug-bpe tokenizer

Assets 4

23 Jul 17:53

jart

0.8.10

f7c6ef4

llamafile v0.8.10

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.

The old llama.cpp server came from a folder named "examples" and was
never intended to be production worthy. This server is designed to be
sturdy and uncrashable. It has /completion and /tokenize endpoints too,
which serves 3.7 million requests per second on Threadripper, thanks to
Cosmo Libc improvements.

See the LLaMAfiler Documentation for further details.

73b1836 Write documentation for new server
b3930aa Make GGML asynchronously cancelable
8604e9a Fix POSIX undefined cancelation behavior
323f50a Let SIGQUIT produce per-thread backtraces
15d7fba Use semaphore to limit GGML worker threads
d7c8e33 Add support for JSON parameters to new server
7f099cd Make stack overflows recoverable in new server
fb3421c Add barebones /completion endpoint to new server

This release restores support for non-AVX x86 microprocessors. We had to
drop support at the beginning of the year. However our CPUid dispatching
has advanced considerably since then. We're now able to offer top speeds
on modern hardware, without leaving old hardware behind.

a674cfb Restore support for non-AVX microprocessors
555fb80 Improve build configuration

Here's the remaining improvements included in this release:

cc30400 Supports SmolLM (#495)
4a4c065 Fix CUDA compile warnings and errors
82f845c Avoid crashing with BF16 on Apple Metal

Assets 4

01 Jul 19:11

jart

0.8.9

cd84736

llamafile v0.8.9

This release gets Gemma2 working closer to how Google intended.

af22695 Make gemma2-27b-it the same as aistudio.google.com
41678c8 Add sliding window mask for Gemma2
140eed5 Add soft-capping to Gemma2

This release fixes Android support. You can now run LLMs on your phone
using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net)
for bug reports and and testing efforts. See also other bug fixes described
by the Cosmopolitan v3.5.4 and v3.5.3 release notes.

Our future replacement for the server now has an /embedding endpoint. On
my workstation, it's currently able to serve 851 requests per second for
a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings
model. None of the requests fail and 99th percentile latency is 56.74ms.

1346ef4 Create /embedding endpoint in new server
263d39b Use float to string conversion
0d62d05 Reclaim llama_decode() memory on cancelation
617d841 Remove ggml_context cache
46dda4f Refactor new server and get leak checker working
cd73243 Prevent vector overflow in llama.cpp

You can try the new embedding server as follows:

make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange

Compatibility with the old server's API of posting JSON content will be
added in upcoming changes. The same goes for the OpenAI API. The goal's
to be compatible with everything.

Contributors

aj47

Assets 4

29 Jun 18:42

jart

0.8.8

571b4e5

llamafile v0.8.8

571b4e5 Fix bug preventing GPU extraction on Windows
4aea606 Support flash attention in --server mode
7fd9101 Don't flush bf16 subnormals to zero
7692b85 Add Google Gemma v2 support
72fb8ca Introduce --special flag

Assets 4

24 Jun 15:00

jart

0.8.7

b2f587c

llamafile v0.8.7

This release includes important performance enhancements for quants.

293a528 Performance improvements on Arm for legacy and k-quants (#453)
c38feb4 Optimized matrix multiplications for i-quants on __aarch64__ (#464)

This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.

0c0e72a Upgrade to Cosmopolitan v3.5.1
629e208 Fix server crash due to /dev/urandom
60404a8 Always use tinyBLAS with AMD GPUs on Windows
6d3590c Pacify --temp flag when running in server mode
a28250b Update GGML_HIP_UMA (#473)
e973fa2 Improve CPU brand detection
9cd8d70 Update sever README build/testing instructions (#461)

It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.

e0656ea Introduce new llamafile server

Assets 6

25 May 14:27

jart

0.8.6

81cfbcf

llamafile v0.8.6

Two minor issues are fixed with this release.

69c2dd3 Don't print special tokens for now (improve shell scriptability)
866a129 Upgrade to Cosmopolitan v3.3.8

See the llamafile v0.8.5 release notes for further details. For driver-only prebuilt AMD GPU support on Windows, please use llamafile v0.8.4 for the next few weeks, until ggml-org/llama.cpp#7156 is resolved.

Assets 7

25 May 09:06

jart

0.8.5

b79ecf4

llamafile v0.8.5

This release fixes bugs and introduces @Kawrakow's latest quant
performance enhancements (a feature exclusive to llamafile). As of #435
the K quants now go consistently 2x faster than llama.cpp upstream. On
big CPUs like Threadripper we've doubled the performance of tiny models,
for both prompt processing and token generation for tiny models (see the
benchmarks below) The llamafile-bench and llamafile-upgrade-engine
commands have been introduced.

a86e7ce Add Script To Upgrade llamafile Archives (#412)
07e87bf 261dfe7 Fix llamafile-quantize and rewrite documentation
938cf72 Faster AVX2 matrix multiplications for MoE models (#428)
eaa756d Faster AVX2 matrix multiplications for legacy quants (#405)
7cb15c6 Another performance optimization for Zen4 + refactoring (#435)
9206719 8b2f8d8 e675719 4451c6d Introduce llamafile-bench command (cpu mode only)
87d4ce1 Fix f16 cpuid check (caused crashes on sandybridge)
5c40565 8d1afe4 Avoid crashing on llava ctrl-c
c0aa43e Introduce bf16 cuda support
00e4f72 Enable GGML_CUDA_FORCE_MMQ in tinyBLAS mode
d228e01 0b5997d 64fbffc Sync with llama.cpp upstream (#427)
c660d38 Add text embedding models to 'other example llamafiles' table (#422)
49cc13c Updated README with instructions to load models from third-party apps (#417)

Note: Please use llamafile v0.8.4 if you need prebuilt (driver-only) AMD GPU support on Windows,
at least for the next few weeks, until ggml-org/llama.cpp#7156 is resolved.

Binaries run on Linux, Windows, MacOS, FreeBSD, OpenBSD, and NetBSD for
AMD64 and ARM64. Supported GPUs are CUDA, ROCm, and Metal. Prebuilt GPU
binaries are provided for CUDA/ROCm on Linux, and CUDA on Windows. To
install this release on systems with a POSIX-style shell:

sudo -s
cd /usr/local
wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.5/llamafile-0.8.5.zip
unzip llamafile-0.8.5.zip
exit
llamafile --help

To upgrade your old llamafiles without needing to redownload, run:

llamafile-upgrade-engine old.llamafile new.llamafile

Prebuilt llamafiles that have the LLM weights included are available at:

https://huggingface.co/Mozilla (official)
https://huggingface.co/models?library=llamafile (community)

Here are some tutorials:

Here are some performance benchmarks for various quantization formats, on the world's flagship CPUs. See https://justine.lol/matmul/ to compare these numbers to where we were back in March two months ago.

cpu_info	model_filename	size	test	t/s
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.BF16	86.99 GiB	pp512	447.01
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.BF16	86.99 GiB	tg16	11.35
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.F16	86.99 GiB	pp512	340.63
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.F16	86.99 GiB	tg16	11.01
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q8_0	46.22 GiB	pp512	288.16
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q8_0	46.22 GiB	tg16	15.82
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q6_K	35.74 GiB	pp512	431.51
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q6_K	35.74 GiB	tg16	22.73
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q5_K_M	30.95 GiB	pp512	427.71
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q5_K_M	30.95 GiB	tg16	24.90
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q4_K_M	26.49 GiB	pp512	440.03
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q4_K_M	26.49 GiB	tg16	27.31
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q4_0	24.63 GiB	pp512	287.51
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q4_0	24.63 GiB	tg16	18.92
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q3_K_M	21.00 GiB	pp512	433.89
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q3_K_M	21.00 GiB	tg16	30.30
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q3_K_S	19.03 GiB	pp512	432.36
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q3_K_S	19.03 GiB	tg16	31.34
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q2_K	16.12 GiB	pp512	449.64
AMD Ryzen Threadripper PRO 7995WX (znver4)	mixtral-8x7b-instruct-v0.1.Q2_K	16.12 GiB	tg16	33.71
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.F32	4.10 GiB	pp512	2103.25
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.F32	4.10 GiB	tg16	57.34
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.BF16	2.05 GiB	pp512	2603.84
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.BF16	2.05 GiB	tg16	77.18
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	pp512	2038.64
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	tg16	80.23
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	pp512	2203.77
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	tg16	100.78
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	pp512	2838.05
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	tg16	135.27
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_1	791.50 MiB	pp512	2328.06
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_1	791.50 MiB	tg16	138.15
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	pp512	2676.14
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	tg16	143.58
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	pp512	2281.44
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	tg16	145.02
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_K_S	729.84 MiB	pp512	2757.59
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q5_K_S	729.84 MiB	tg16	143.59
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_1	668.18 MiB	pp512	2444.11
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_1	668.18 MiB	tg16	148.50
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_K_M	636.18 MiB	pp512	2758.90
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_K_M	636.18 MiB	tg16	149.92
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_K_S	609.53 MiB	pp512	2847.95
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_K_S	609.53 MiB	tg16	150.84
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	pp512	2420.58
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	tg16	154.27
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q3_K_L	563.42 MiB	pp512	2743.74
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q3_K_L	563.42 MiB	tg16	155.29
AMD Ryzen Threadripper PRO 7995WX (znver4)	TinyLlama-1.1B-Chat-v1.0.Q3_K_M	522.30 MiB...

Contributors

Kawrakow

Assets 7

10 May 09:30

jart

0.8.4

30cdd9c

llamafile v0.8.4

This release fixes underflows and overflows.

A memory bug in the grammar parser has been fixed, that caused commands like ./llamafile -m foo.gguf -p bar --grammar 'root::="' (which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95 and 3fe045f. Credit for discovering (and most importantly, reporting) this issue goes to Eclypsium Security Researcher Richard Johnson. We incorrectly reported earlier that this fix was incorporated into the v0.8.2 release. You need to use the v0.8.4 release. This bug fix was upstreamed in ggml-org/llama.cpp#7194
Our new vectorized expf() implementation now handles underflow by producing subnormals rather than flushing to zero. b5c6df6

See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)

Assets 5

09 May 23:20

jart

0.8.2

4ee1e39

llamafile v0.8.2

llamafile lets you distribute and run LLMs with a single file

This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggml-org/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems).
Text generation (or prediction) should now go slightly faster too, thanks to development work matmul kernels, and enhancements to thread synchronization (see 89c189e) which should be noticed most on CPUs with many cores running smaller models. MacOS ARM users who are using CPU rather than Metal can expect to see the biggest boost, now that llamafile knows how to utilize all cores (see 6c45e3e).
Bugs in the server /embedding endpoint have been fixed (see 0e2845a and 7900294). You can also now pass llamafile --embedding -m model -p prompt to have embeddings printed to standard output (see 42bd9b8).
This release synchronizes with the upstream llama.cpp project as of May 7th in 94d0940, which improves tokenization for Command-R, Refact, Olmo, and StarCoder. There's a new flash attention op that may be enabled for many models by passing the -fa flag. We haven't been able to include this in our prebuilt cuda/rocm binaries yet, so you may need to pass the llamafile --recompile flag for GPU.
This release introduces the --precise, --fast, and --trap flags, which control the execution of math. The --precise flag can slightly enhance the thinking of LLMs at the cost of some performance (see 2af3b88 and 9540b43). The --fast flag is included since it's unspecified which mode llamafile will use for any given situation (see bbae0f6 and b749326). The --trap flag can help you pinpoint the exact moment any NaNs appear (on CPUs that support this, e.g. most of x86), which is useful for troubleshooting. Additionally, a new vectorized expf() function has been introduced that enables llamafile to compute the exponent function faster and at full quality (see e2b3cb2). This matters because it's the function that powers SiLU and SoftMax which are used by most of todays premier public models.
Most of the CPU code in the GGML library now has optimal performance across different hardware architectures, thanks to new build system techniques. Features or specific options or models that underperformed before, may do better now (see 0bdea60 and c9d7393).

Additional fixes:

a2d159e Fix server multimodal statistics (#392)
aa8c01a Revert moondream vision language model support
eecbf89 More conservative strong/em markdown matcher (#352)
38311f2 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2
58d2ca0 Use qsort and set linkage to static for internal functions used for offload-arch-fix (#375)
4ee1e39 The PDF documentation in llamafile-0.8.2.zip is now fixed
4ee1e39 Remove warnings from cuda build

Additional notes:

We're experiencing some instability with our Windows AMD GPU support. If you encounter crashes using the -ngl 999 flag on Windows, then try using the previous 0.8.1 release. Please also consider filing an issue, to report if it doesn't work, or better yet, please file an issue if it does work, since we otherwise have no way of knowing that (llamafile doesn't have telemetry because maximally respecting the user's privacy on their local machine is one of the stated goals of the project). You can also share details about your experience with us on the Mozilla AI Discord server.

See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)

Contributors

ikawrakow

Assets 5

26 Apr 20:33

jart

0.8.1

2095d50

llamafile v0.8.1

Support for Phi-3 Mini 4k has been introduced
A bug causing GPU module crashes on some systems has been resolved
Support for Command-R Plus has now been vetted with proper 64-bit indexing
We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using --nocompile or --recompile. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit!
An assertion error has been fixed that happened when using llamafile-quantize to create K quants from an F32 GGUF file
A new llamafile-tokenize command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can say cat file.txt | llamafile-tokenize -m model.llamafile | wc -l since it prints each token on a single line.

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Contributors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contributors

Uh oh!

Uh oh!

Contributors

Uh oh!

Uh oh!

Releases: Mozilla-Ocho/llamafile

llamafile v0.8.11

Uh oh!

llamafile v0.8.10

Uh oh!

llamafile v0.8.9

Contributors

Uh oh!

llamafile v0.8.8

Uh oh!

llamafile v0.8.7

Uh oh!

llamafile v0.8.6

Uh oh!

llamafile v0.8.5

Contributors

Uh oh!

llamafile v0.8.4

Uh oh!

llamafile v0.8.2

Contributors

Uh oh!

llamafile v0.8.1

Uh oh!